For the Spring 2023 semester in STAT 370 Data Science Consulting, we worked closely with the Senior Data Scientist of a major food distribution company to create a model that maximizes daily productivity in their warehouses. Given productivity metrics data, we had to transform, analyze, and select features that contributed to the daily productivity rate called “cases-per-man-hour”. We built multiple ensemble models and optimized them using grid searches. To better analyze and compare model performance and data trends we also created and implemented a Plotly visualization.
To achieve the goal of maximizing warehouse productivity, we must provide a wide variety of products to meet the demand. Along with warehouse physical capacity, we need to know how much labor we need. We create labor demand using productivity metrics. The goal of this project is to create a daily productivity model for warehouses.
Warehouses are the central hub of distributing goods to their partners, as they are in charge of packing, processing, and shipping the cases. These locations all measure different metrics such as the number of cases processed, the number of hours worked by specific work groups, different product zones and the number of items selected from each zone, and many others. To measure daily productivity we focus on the variable “cases-per-man-hour” (CPMH). This is a rate that is derived from the total number of cases processed by a warehouse per hour. The higher this rate, the more cases are processed per hour, which ultimately translates to higher productivity. We want to determine the factors that help predict and maximize this rate to accurately track and coordinate labor within warehouses.
The data was from over a four-year period, between 2019 and early 2023. There are 70 warehouses represented across three separate datasets. Before getting into the datasets specifically, it is important to clarify some common warehouse terms that are not generally known. Firstly, a SKU, or a stock-keeping unit, is a specific product identifier that is used by the manufacturer to track and manage their inventory. [1]. An example of a specific SKU could be 2% milk or fat-free cottage cheese. Eaches is related to SKU, as it refers to any individual item in a warehouse. So a gallon of milk is an each or a carton of eggs is an each. [2]. A case, meanwhile, is a container of eaches that is then either stored or sent to a consumer from the warehouse. [2]. Generally, a case would contain only eaches of the same SKU, but if a case was being sent to a small-scale consumer that does not need enough of a product to fill a case, the case that is sent to them may have eaches from multiple SKUs in it. The needs of a small local bakery and that of a 2000-student high school are operating at very different scales regarding their ordering needs and ordering quantities. With all that being said, hopefully, the features of the datasets will not be as confusing and ambiguous.
The first dataset, titled hours, contained information about the number of cases processed per day, per warehouse, and per cohort. Each observation in this dataset was of how many hours worked and cases processed by a specific cohort of a specific warehouse on a specific day. Our client informed us that each warehouse worker is sorted into a different cohort based on their skill level and experience. There are three different cohort levels: A, B, and C. Cohort A contains the most skilled and most experienced workers, Cohort B contains the next most skilled and experienced workers, and Cohort C contains the least skilled and experienced workers. Concerning our specific task, our client told us that we should expect to see a productivity increase from the more experienced cohorts versus the least experienced cohorts. (See Section Preprocessing). The next dataset, entitled cases, included information about the number of eaches processed per warehouse, per zone, and per day. The different zones included are dry, frozen, and cooled. These zones correspond with where each “eaches” came from (i.e. milk would be in the cooled zone while ice cream would be in the frozen zone). Finally, our last dataset, entitled sae, included data relating to the labeling system of each warehouse. This dataset has three variables, warehouse, a go-live date, and label type. Before each go-live date, the warehouse would use a system of calling out the SKU of each product to their coworkers so they would know what eaches to get. After this go-live date, a new system was implemented whereby each SKU was given a unique barcode that would tell the warehouse worker which item to get to pack a case. The label type variable indicates which specific type of barcode system was implemented in each warehouse. Our client gave us this dataset because, for about a four-week period after the go-live date, we should expect a significant drop in productivity in each warehouse. It is important to note that each warehouse has a different go-live date.
We initially wanted to use all of the data available to us to predict the productivity rate, but after doing some exploratory data analysis and speaking with our client, we, unfortunately, had to cut most of the data that was given to us and all of the data that we had collected externally. Initially, we planned to use the cases and eaches information to predict a productivity metric that would represent how much “stuff” was getting during that particular day in that particular warehouse. However, the cases dataset (the one with the information about eaches), did not contain any information about the number of hours worked. There was also no cohort information present that could have allowed us to match the different cohorts from the hours dataset. And after speaking to our client, we concluded that with the data we have, there was no way for us to accurately determine how many hours were worked in the eaches dataset.
We also wanted to include information about the labeling systems in our models as we were told it would significantly affect productivity. However, we just did not have enough time to implement a way to incorporate this data into our models. Although this information is not in our models, we will show later how productivity and our predictions of productivity are affected by these labeling changes. (See Figure 10: Productivity drop after Label Change for an example of this decline).
Therefore, the data we were left with was entirely from the hours
dataset and included date, cohort, and branch information, which will
all be used as our primary predictors in our modeling. With our main
goal of creating a model that best predicts the cases-per-man-hour rate
of a warehouse, our data required some transformations to create our
response variable, as well as other variables associated with measuring
that response. To calculate cases-per-man-hour, we grouped all
observations by date and warehouse, then we summed all of the hours and
cases processed by each cohort together. We then created the response
variable by dividing the total number of cases by hours. This new
response variable, known as consump, represents the case
productivity rate for all cohorts during a specific day in a specific
warehouse. To prevent the loss of the cohort information, we created new
variables that would measure the percentage of hours worked and cases
processed by each cohort. This was done by taking the original hour and
case numbers and dividing it by the new total. Finally, our last step
was to convert the dates into more usable formats, which was done by
extracting the specific day of the week and month from each date and
adding them as separate variables. Although we initially did more
“wrangling” to our dataset, this other wrangling is not relevant to our
final dataset so it is not included in this report. (This is still
included in our code, as seen in Appendix B). Our final dataset includes
our response, cohort information (% of hours and cases worked), date
information (both weekday and month), and branch information.
After we were finished “wrangling” our data, we did some analysis on it and found some alarming trends that needed to be addressed before we moved on to model building. The first of these problems was the extreme outliers we were seeing in our response variable. There were productivity rates from zero cases-per-man-hour, all the way to over 200,000 cases per man-hour. Figure 1: Histogram of Outliers in the Response Variable. After conferring with our client, we learned that the max cases-per-man-hour should be around 300. We also saw that there were a few thousand observations of less than ten total hours worked. Figure 2: Distribution of Hours Pre-Wrangling. This is when we came to realize the inconsistency of the record-keeping that led to our data. We learned from our client that there was no standard method by which hours and cases were recorded in the warehouses, which means that there is no certainty that all of our data is accurate. This led us to the conclusion that we should drop all of the observations that were not feasible or realistic, which we decided to be all cases-per-man-hour rates above 300 and all total hours worked below 10. (These numbers were approved by our client as reasonable cutoffs for our data). This led to dropping about 4000 observations, but also significantly reducing the outliers present in our data. Figure 3: Post-Wrangling Key Variable Distributions.
During our analysis that we described above, we also found some interesting information that helped inform us about how to proceed with our model building. The first of which was to confirm our client’s assertion that we should see a difference in the productivity rates between the different cohorts. This difference does appear in our data, with cohorts A and B being much more productive than C. Figure 4: Difference in Cases-per-Man-Hour by Cohort. Another interesting discovery that we made was the weekdays on which the warehouses were open. As we can see from Figure 5: Number of Observations by Weekday, both Saturday and Sunday have hardly any observations, with Fridays having significantly less than the other days as well. This would suggest that all of the warehouses are almost universally closed on Sundays, and almost all of them are closed on Saturdays as well. We used these findings to help us create the models that we discuss in the next section.
In total, we created 8 models based on our final dataset. We created one linear regression, four Gradient Boosted Regression models, one Extra Tree Model, one Random Forest, and one decision tree. We utilized pipelines to help us standardize the process of transforming and splitting the data. The first step we took in preparing the data was to set a standard random seed to ensure that the data would be split the same way each time and that any random aspects of any of our models would stay constant. We then created two pipelines, one to scale the numeric variables and one to transform the categorical variables into a form that could be used by the model functions. All of our pipelines, scalers, and models are from the Scikit-Learn Python package. After our data was in a suitable form, we then moved on to building our models.
Since our response variable was a rate, we decided to fit our data into a linear regression to serve as a baseline. The first model we focused on from here was a Gradient Boosted Regressor. GBR is an ensemble method that creates different weak models (trees) which are then combined as a whole for better performance [3]. We started outfitting our data into this model with its default parameters, which were a learning rate of 0.1, 100 estimators, and a max depth of 3. For clarity, the learning rate shrinks the contribution of each tree that was created, the number of estimators represents the number of trees created, and the depth determines how many splits the tree makes. [4]. The higher the number of estimators and depth is, the more the tree fits, but we had to keep in mind the more computationally expensive it was to fit as well. To tune this model, we performed a grid search that fit every possible combination of parameters on our data to find the best-performing set. We found that a learning rate of 0.3, 2000 estimators, and a max depth of 10 returned the best root mean squared error. However, when we analyzed the fit of our model against the actual data, we noticed that the model had very high variability as seen in Figure 11: Comparing the Old and New GBR Models after Depth Change. After researching what might cause high variability in GBR models, we found that the ideal max depth for these models should be somewhere between 3 - 8 [5]. We kept the learning rate and number of estimators the same, but changed the depth to 6 and received the same RMSE and score, but also reduced the variability as well and fit much better with the actual data which can be seen in Figure 11: Comparing the Old and New GBR Models after Depth Change.
GBR, our client also suggested building an Extra Trees Regressor. An Extra Tree Regressor has a similar architecture to a random forest but randomly chooses its splits instead of finding the optimal one making this more computationally efficient. [6]. We fit our data to this model with default parameters of max depth being none, meaning the nodes of a tree will keep expanding until all the leaves of the tree contain less than the minimum number of samples required to split a node [7], as well as the number of estimators equal to 100. We were surprised to see almost similar performance as our tuned GBR, making this model the second-best model and the only other model worth mentioning in terms of performance. Due to the amount of time we had spent tuning the GBR and fitting other models such as a random forest and decision tree, we did not perform a grid search and tune this model. However, with the performance it returned with the default parameters, we wonder if a tuned version of this model could have outperformed the GBR, which is one goal we hope to achieve in the future.
Our goal, as per our client, was to produce a model that could achieve an RMSE of less than 10, as that level of prediction accuracy would make it suitable for a production environment. Although we tried different models with various parameters, we were unable to create a model with the target RMSE. (See Figure 6: Model Metrics for the table of model metrics). By our determination, our best models were the tuned GBR and the Extra Trees Regressor, both with an RMSE of 15.95. Our other models such as the Random Forest and Decision tree returned RMSEs of 16.50 and 22.05 respectively, and with these models performing worse both score-wise and computationally, we prefer the GBR and Extra Trees. We can see through the comparison of the model prediction line vs the actual data in Figure 14 that these two models fit the best. We believe that with more specific data and accurate data, these models could have performed even better and brought us closer to the target RMSE.
In terms of our features, one interesting find with our models is the insignificance of the day of the week and month. In every model we had built, none had returned the day of the week or the month as an important feature, which contrasts our initial belief based on our analysis. When looking at what were the most important predictors for our models, we find that, among our best two models, the specific warehouse, the percentage of hours that cohort C worked, and the percentage of cases that cohort A processed are consistently the most important predictors for cases per man hour. We determined this by analyzing a permutation plot that randomly shuffles the values within a feature, breaking the relationship between that feature and the response variable, and indicating the importance of the feature to the model based on the drop in model score. The results of our permutation importance plot can be viewed in Figure 7: GBR Feature Importance. With this, we can infer that the specific branch of the warehouse is a big factor in predicting the cases-per man-hour rate, as some warehouses may be bigger or have more people, leading to the rate differing among these warehouses. It seems that cohort is also an important factor in warehouse productivity, with the best workers likely being from cohort A, basing this assumption on our previous data analysis and what our client has told us.
Reflecting on our work, an addition that would have benefitted us greatly would be more warehouse-specific data. Our client had dreamed of a model for daily productivity in a warehouse without being location/branch specific. With the data we were given and our models returning branch code as an extremely important factor in predicting cases per man hour, this was a goal that could not be accomplished without more warehouse data. This would entail factors such as the size of the warehouse, the number of people working per cohort, the average number of cases processed per day, and anything else that could help us to predict productivity. Having these additional pieces of information would have aided us in creating a model that was more catered to daily productivity in warehouses. We could have also benefited from better and more consistent record-keeping, specifically fewer outliers and null values, as this would have allowed us to keep more observations in our data.
One goal we wish to continue finishing in the future is incorporating the SAE data that we were given. Due to time constraints, we were not able to implement this data into our model. If we were to continue predicting past productivity, being able to implement this would be our next step in trying to better our predictions.
Another goal that we started and wish to accomplish in the future is to incorporate weather into our data to determine its effects on warehouse productivity and utilize it to predict warehouse closures. We had already integrated weather data through an API that added features such as the amount of rain or snow, sunrise and sunset, wind speeds, temperatures, visibility, and many others on a daily level based on the location of each warehouse. Warehouse closures have a dramatic effect on productivity, and being able to predict when that event may occur based on weather could greatly benefit in compensating and adjusting productivity metrics. The weather data could have also helped with productivity metrics of deliveries and shipping, as this area of the business is the one who have the most direct contact with outside environmental features. Overall, we are happy with the work we have done given the data we had at our disposal.
[1] https://www.e-grocery.com/industry-terms/stock-keeping-unit-sku/
[2] https://www.e-grocery.com/industry-terms/eaches/
[4] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.htm
[5] https://inria.github.io/scikit-learn-mooc/python_scripts/ensemble_hyperparameters.html
[6] https://quantdare.com/what-is-the-difference-between-extra-trees-and-random-forest/#
[7] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html
| Model | RMSE | R2 | Model Score |
|---|---|---|---|
| Base Linear Regression | 20.62933 | 0.4879033 | 0.6580482 |
| Base GBR | 22.01128 | 0.0470882 | 0.6106993 |
| Improved GBR | 16.42754 | 0.7491966 | 0.7831598 |
| Improved GBR 2.0 | 16.36043 | 0.7401351 | 0.7849279 |
| Improved GBR 3.0 | 15.94973 | 0.7430484 | 0.7955904 |
| Xtra Tree | 15.97468 | 0.7587561 | 0.7949504 |
| RFR | 16.50292 | 0.7253529 | 0.7811652 |
| Decision Tree | 21.97959 | 0.6119780 | 0.6118193 |
## ([<matplotlib.axis.YTick object at 0x7f9ead52fb20>, <matplotlib.axis.YTick object at 0x7f9ead52fa60>, <matplotlib.axis.YTick object at 0x7f9ead52ea40>, <matplotlib.axis.YTick object at 0x7f9ead5470d0>, <matplotlib.axis.YTick object at 0x7f9ead55be20>], [Text(0, 0.5, 'c_cases'), Text(0, 1.5, 'brnch_cd'), Text(0, 2.5, 'a_hrspct'), Text(0, 3.5, 'c_hrspct'), Text(0, 4.5, 'a_cases')])
## {'whiskers': [<matplotlib.lines.Line2D object at 0x7f9ead3d3df0>, <matplotlib.lines.Line2D object at 0x7f9ead3d2590>, <matplotlib.lines.Line2D object at 0x7f9ead3d2200>, <matplotlib.lines.Line2D object at 0x7f9ead3d27d0>, <matplotlib.lines.Line2D object at 0x7f9ead3f11e0>, <matplotlib.lines.Line2D object at 0x7f9ead3f1780>, <matplotlib.lines.Line2D object at 0x7f9ead3f3b50>, <matplotlib.lines.Line2D object at 0x7f9ead3f1750>, <matplotlib.lines.Line2D object at 0x7f9ead57cee0>, <matplotlib.lines.Line2D object at 0x7f9ead57c160>], 'caps': [<matplotlib.lines.Line2D object at 0x7f9ead3d29b0>, <matplotlib.lines.Line2D object at 0x7f9ead3d0070>, <matplotlib.lines.Line2D object at 0x7f9ead3f3640>, <matplotlib.lines.Line2D object at 0x7f9ead3f1cc0>, <matplotlib.lines.Line2D object at 0x7f9ead3f1d50>, <matplotlib.lines.Line2D object at 0x7f9ead3f23b0>, <matplotlib.lines.Line2D object at 0x7f9ead3f3d00>, <matplotlib.lines.Line2D object at 0x7f9ead3f0e80>, <matplotlib.lines.Line2D object at 0x7f9ead57c6a0>, <matplotlib.lines.Line2D object at 0x7f9ead57cb80>], 'boxes': [<matplotlib.lines.Line2D object at 0x7f9ead3d1570>, <matplotlib.lines.Line2D object at 0x7f9ead3d3b80>, <matplotlib.lines.Line2D object at 0x7f9ead3f0b50>, <matplotlib.lines.Line2D object at 0x7f9ead3f35e0>, <matplotlib.lines.Line2D object at 0x7f9ead57e470>], 'medians': [<matplotlib.lines.Line2D object at 0x7f9ead3d29e0>, <matplotlib.lines.Line2D object at 0x7f9ead3f02e0>, <matplotlib.lines.Line2D object at 0x7f9ead3f2920>, <matplotlib.lines.Line2D object at 0x7f9ead3f3160>, <matplotlib.lines.Line2D object at 0x7f9ead57d030>], 'fliers': [<matplotlib.lines.Line2D object at 0x7f9ead3d2020>, <matplotlib.lines.Line2D object at 0x7f9ead3f0580>, <matplotlib.lines.Line2D object at 0x7f9ead3f2f50>, <matplotlib.lines.Line2D object at 0x7f9ead57f9d0>, <matplotlib.lines.Line2D object at 0x7f9ead57d5d0>], 'means': []}
## ([<matplotlib.axis.YTick object at 0x7f9ead43dde0>, <matplotlib.axis.YTick object at 0x7f9ead426380>, <matplotlib.axis.YTick object at 0x7f9ead43d570>, <matplotlib.axis.YTick object at 0x7f9ead474070>, <matplotlib.axis.YTick object at 0x7f9ead475540>], [Text(0, 0.5, 'c_cases'), Text(0, 1.5, 'brnch_cd'), Text(0, 2.5, 'c_hrspct'), Text(0, 3.5, 'a_hrspct'), Text(0, 4.5, 'a_cases')])
## {'whiskers': [<matplotlib.lines.Line2D object at 0x7f9ead482ad0>, <matplotlib.lines.Line2D object at 0x7f9ead4a0190>, <matplotlib.lines.Line2D object at 0x7f9ead4a1720>, <matplotlib.lines.Line2D object at 0x7f9ead4a2470>, <matplotlib.lines.Line2D object at 0x7f9ead4a3e80>, <matplotlib.lines.Line2D object at 0x7f9ead4a0e80>, <matplotlib.lines.Line2D object at 0x7f9ead2bcbb0>, <matplotlib.lines.Line2D object at 0x7f9ead2bd000>, <matplotlib.lines.Line2D object at 0x7f9ead2bea10>, <matplotlib.lines.Line2D object at 0x7f9ead2bee30>], 'caps': [<matplotlib.lines.Line2D object at 0x7f9ead4a1660>, <matplotlib.lines.Line2D object at 0x7f9ead4a0790>, <matplotlib.lines.Line2D object at 0x7f9ead4a28c0>, <matplotlib.lines.Line2D object at 0x7f9ead4a2ce0>, <matplotlib.lines.Line2D object at 0x7f9ead4a2a10>, <matplotlib.lines.Line2D object at 0x7f9ead2bc040>, <matplotlib.lines.Line2D object at 0x7f9ead2bd450>, <matplotlib.lines.Line2D object at 0x7f9ead2bd900>, <matplotlib.lines.Line2D object at 0x7f9ead2bf280>, <matplotlib.lines.Line2D object at 0x7f9ead2bf6a0>], 'boxes': [<matplotlib.lines.Line2D object at 0x7f9ead4a0a00>, <matplotlib.lines.Line2D object at 0x7f9ead4a1a20>, <matplotlib.lines.Line2D object at 0x7f9ead4a39d0>, <matplotlib.lines.Line2D object at 0x7f9ead2bc7c0>, <matplotlib.lines.Line2D object at 0x7f9ead2be560>], 'medians': [<matplotlib.lines.Line2D object at 0x7f9ead4a02b0>, <matplotlib.lines.Line2D object at 0x7f9ead4a3160>, <matplotlib.lines.Line2D object at 0x7f9ead2bc490>, <matplotlib.lines.Line2D object at 0x7f9ead2bdd20>, <matplotlib.lines.Line2D object at 0x7f9ead2bfb20>], 'fliers': [<matplotlib.lines.Line2D object at 0x7f9ead4a04c0>, <matplotlib.lines.Line2D object at 0x7f9ead4a3580>, <matplotlib.lines.Line2D object at 0x7f9ead475ff0>, <matplotlib.lines.Line2D object at 0x7f9ead2be140>, <matplotlib.lines.Line2D object at 0x7f9ead2bff40>], 'means': []}
## ([<matplotlib.axis.YTick object at 0x7f9ead5f82e0>, <matplotlib.axis.YTick object at 0x7f9ead3320e0>, <matplotlib.axis.YTick object at 0x7f9ead3317b0>, <matplotlib.axis.YTick object at 0x7f9ead364a30>, <matplotlib.axis.YTick object at 0x7f9ead3669e0>], [Text(0, 0.5, 'c_cases'), Text(0, 1.5, 'brnch_cd'), Text(0, 2.5, 'c_hrspct'), Text(0, 3.5, 'a_hrspct'), Text(0, 4.5, 'a_cases')])
## {'whiskers': [<matplotlib.lines.Line2D object at 0x7f9ead391ae0>, <matplotlib.lines.Line2D object at 0x7f9ead3926e0>, <matplotlib.lines.Line2D object at 0x7f9ead3927a0>, <matplotlib.lines.Line2D object at 0x7f9ead393670>, <matplotlib.lines.Line2D object at 0x7f9ead1b4370>, <matplotlib.lines.Line2D object at 0x7f9ead1b47c0>, <matplotlib.lines.Line2D object at 0x7f9ead1b6140>, <matplotlib.lines.Line2D object at 0x7f9ead1b65f0>, <matplotlib.lines.Line2D object at 0x7f9ead1b7f70>, <matplotlib.lines.Line2D object at 0x7f9ead1b5360>], 'caps': [<matplotlib.lines.Line2D object at 0x7f9ead3915d0>, <matplotlib.lines.Line2D object at 0x7f9ead3920b0>, <matplotlib.lines.Line2D object at 0x7f9ead393a90>, <matplotlib.lines.Line2D object at 0x7f9ead393ee0>, <matplotlib.lines.Line2D object at 0x7f9ead1b4c10>, <matplotlib.lines.Line2D object at 0x7f9ead1b5030>, <matplotlib.lines.Line2D object at 0x7f9ead1b6a10>, <matplotlib.lines.Line2D object at 0x7f9ead1b6e30>, <matplotlib.lines.Line2D object at 0x7f9ead1b6860>, <matplotlib.lines.Line2D object at 0x7f9ead1b7eb0>], 'boxes': [<matplotlib.lines.Line2D object at 0x7f9ead3921a0>, <matplotlib.lines.Line2D object at 0x7f9ead3928f0>, <matplotlib.lines.Line2D object at 0x7f9ead393b80>, <matplotlib.lines.Line2D object at 0x7f9ead1b5d20>, <matplotlib.lines.Line2D object at 0x7f9ead1b7b20>], 'medians': [<matplotlib.lines.Line2D object at 0x7f9ead390f10>, <matplotlib.lines.Line2D object at 0x7f9ead391030>, <matplotlib.lines.Line2D object at 0x7f9ead1b5480>, <matplotlib.lines.Line2D object at 0x7f9ead1b72b0>, <matplotlib.lines.Line2D object at 0x7f9ead1d0460>], 'fliers': [<matplotlib.lines.Line2D object at 0x7f9ead393070>, <matplotlib.lines.Line2D object at 0x7f9ead392530>, <matplotlib.lines.Line2D object at 0x7f9ead1b5900>, <matplotlib.lines.Line2D object at 0x7f9ead1b7700>, <matplotlib.lines.Line2D object at 0x7f9ead1d0880>], 'means': []}
knitr::opts_chunk$set(include = F)
setwd("/home/risto/gits/newn/stat_370_final")
#TODO make sure to resolve all TODO's *******
library(reticulate)
library(kableExtra)
library(tidyverse)
library(plotly)
if(!virtualenv_exists(envname = "./.venv")){
install_python(version="3.10.11")
virtualenv_create(envname = "./.venv", version = "3.10.11", packages = c("pandas==1.5", "numpy", "matplotlib", "scikit-learn==1.1.3", "joblib"))
}
use_virtualenv("./.venv")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
"""# Cases
* Change date to date_time object
* Create a new Date column for both datasets
* Allows merging on shared column
* Sort dates in ascending order
"""
sae = pd.read_csv('data/sae.csv')
cases = pd.read_csv("data/Cases.csv")
#change CLNDR_DT from object to date_time
# make sure to create new column 'Date' for both datasets so they share common
# for merging
cases['Date'] = pd.to_datetime(cases['CLNDR_DT'])
#sort dates from past to current
cases = cases.sort_values(by='CLNDR_DT')
## needed to drop NaN in order for 'tot_each_day' to be correct number
## ask why including NaN would yield wrong sum
## what values does NaN have?
cases = cases.dropna()
## create new dataframe from cases
## focusing on getting total of eaches selected per day
cases_tot_day = pd.DataFrame(cases)
## group by Date and BRNCH_CD, sum eaches_selected by each zone into one total each column
cases_tot_day['Total_Each_Day'] = pd.DataFrame(cases.groupby(['Date', 'BRNCH_CD'])['EACHES_SELECTED'].transform('sum'))
## New Dataframe with order = Date, Branch, zone, eaches selected, total eaches
cases4 = pd.DataFrame(cases, columns = (['Date', 'BRNCH_CD', 'ZONE', 'EACHES_SELECTED', 'Total_Each_Day']))
## Change index starting at 1, drop existing index to avoid adding as column
cases4 = cases4.reset_index(drop=True)
## get ratio of total eaches per zone
## eaches selected of zone / total eaches
## add ratio to a list and append to dataset after
ratio = []
for i in range(0, len(cases4)):
#print(i)
r = cases4.EACHES_SELECTED[i] / cases4.Total_Each_Day[i]
ratio.append(r)
## add list of ratios to dataframe, should be in order
cases4['ratio'] = ratio
#type(cases4.ZONE[1])
## create a list for every zone (dry, clr, frz)
## for every row in dataframe:
## if the zone of the row == "DRY":
## add the ratio at index i to the "DRY" list
## add zeros to the other lists (ClR, FRZ) (place holder)
## if zone of row == "CLR":
## add ratio to CLR list
## add zeros to other lists
## repeat for FRZ
## add lists as columns to dataset
dry_ratio = []
clr_ratio = []
frz_ratio = []
for i in range(0, len(cases4)):
if cases4.ZONE[i] == 'DRY':
dry_ratio.append(cases4.ratio[i])
clr_ratio.append(0)
frz_ratio.append(0)
elif cases4.ZONE[i] == 'FRZ':
dry_ratio.append(0)
clr_ratio.append(0)
frz_ratio.append(cases4.ratio[i])
elif cases4.ZONE[i] == 'CLR':
dry_ratio.append(0)
clr_ratio.append(cases4.ratio[i])
frz_ratio.append(0)
cases4['dry_ratio'] = dry_ratio
cases4['clr_ratio'] = clr_ratio
cases4['frz_ratio'] = frz_ratio
## sort dataframe by date and branch to see everything together
## check if math is correct
cases4 = cases4.sort_values(by=['Date', 'BRNCH_CD'], ascending = [True, True])
## smoosh multiple rows of same branch and date into 1 row
## aggregate sums, zeros as placeholders will have no effect on ratio
cases6 = cases4.groupby(['Date', 'BRNCH_CD', 'Total_Each_Day'], as_index = False).agg('sum')
## drop eaches since same as total
## drop ratio since all ratios add up to 1 anyway
cases_EachPerDay = cases6.drop(columns=['EACHES_SELECTED', 'ratio'])
"""# Hours
* Repeat steps from cases
"""
hours = pd.read_csv("data/hours.csv")
hours['Date'] = pd.to_datetime(hours['REPORT_DATE'])
hours = hours.sort_values(by=['REPORT_DATE', 'BRNCH_CD'], ascending = [True, True])
hours = hours.drop(columns=['DIV_NBR', 'REPORT_DATE', 'FULL_MARKET_NAME'])
## drop "NaN" (UNKOWN)
hours = hours[hours.COHORT != 'UNKNOWN']
## create datarame in order desired
hours_co = pd.DataFrame(hours, columns = ['Date', 'BRNCH_CD', 'COHORT', 'CASES_SELECTED', 'MEASURED_DIRECT_HRS'])
## group by Date and branch to get sum of HOURS MEASURED of each cohort as total hours worked at branch on specific day
hours_co['Total_Hours'] = pd.DataFrame(hours.groupby(['Date', 'BRNCH_CD'])['MEASURED_DIRECT_HRS'].transform('sum'))
## group by Date and branch to get sum of CASES SELECTED of each cohort as total cases selected at branch on specific day
hours_co['Total_Cases'] = pd.DataFrame(hours.groupby(['Date', 'BRNCH_CD'])['CASES_SELECTED'].transform('sum'))
hours_co = hours_co.reset_index(drop = True)
## like hours, get ratio of hours worked per cohort relative to total hours measured at branch on specific day
Hrs_PerCo = []
for i in range (0, len(hours_co)):
r = hours_co.MEASURED_DIRECT_HRS[i] / hours_co.Total_Hours[i]
Hrs_PerCo.append(r)
hours_co['Hrs_Pct'] = Hrs_PerCo
## same process of placing respective hours ratios per cohort as eaches per zone in cases
## 0.000 = placeholder for aggregation
A_Hours = []
B_Hours = []
C_Hours = []
for i in range(0, len(hours_co)):
if hours_co.COHORT[i] == 'A':
A_Hours.append(hours_co.Hrs_Pct[i])
B_Hours.append(0)
C_Hours.append(0)
elif hours_co.COHORT[i] == 'C':
A_Hours.append(0)
B_Hours.append(0)
C_Hours.append(hours_co.Hrs_Pct[i])
elif hours_co.COHORT[i] == 'B':
A_Hours.append(0)
B_Hours.append(hours_co.Hrs_Pct[i])
C_Hours.append(0)
hours_co['A_HrsPct'] = A_Hours
hours_co['B_HrsPct'] = B_Hours
hours_co['C_HrsPct'] = C_Hours
## ratio of cases per cohort
Cases_PerCo = []
for i in range (0, len(hours_co)):
r = hours_co.CASES_SELECTED[i] / hours_co.Total_Cases[i]
Cases_PerCo.append(r)
hours_co['Cases_Pct'] = Cases_PerCo
## cases ratio per cohort w/ placeholders
A_Cases = []
B_Cases = []
C_Cases = []
for i in range(0, len(hours_co)):
if hours_co.COHORT[i] == 'A':
A_Cases.append(hours_co.Cases_Pct[i])
B_Cases.append(0)
C_Cases.append(0)
elif hours_co.COHORT[i] == 'C':
A_Cases.append(0)
B_Cases.append(0)
C_Cases.append(hours_co.Cases_Pct[i])
elif hours_co.COHORT[i] == 'B':
A_Cases.append(0)
B_Cases.append(hours_co.Cases_Pct[i])
C_Cases.append(0)
hours_co['A_Cases'] = A_Cases
hours_co['B_Cases'] = B_Cases
hours_co['C_Cases'] = C_Cases
hours_CoHrs = hours_co.groupby(['Date', 'BRNCH_CD', 'Total_Hours', 'Total_Cases'], as_index = False).agg('sum')
hours_CoHrs
"""# New Datasets"""
## Date = CLNDR_DATE
## BRNCH_CD = code referring to sepcific Warehouse
## Total_Each_Day = sum of eaches selected by all zones on specific date
## dry_ratio = proportion of dry products relative to total eaches selected on specific date
## clr_ratio = proportion of cooler products relative to total eaches selected on specific date
## frz_ratio = proportion of freezer products relative to total eaches selected on specific date
hours_CohortHrsCases = pd.DataFrame(hours_CoHrs, columns = ['Date', 'BRNCH_CD',
'Total_Hours', 'Total_Cases',
'A_HrsPct',
'B_HrsPct', 'C_HrsPct',
'A_Cases', 'B_Cases','C_Cases'
])
## Date = REPORT_DATE
## BRNCH_CD = code referring to sepcific Warehouse
## Total_Hours = sum of hours worked by all cohorts on specific date
## A_HrsPct = proportion of hours worked by Cohort A relative to total hours on specific date
## B_HrsPct = proportion of hours worked by Cohort B relative to total hours on specific date
## C_HrsPct = proportion of hours worked by Cohort C relative to total hours on specific date
## Total_Cases = Sum of all cases selected by all cohorts on specific date
## A_Cases = proportion of cases selected by Cohort A relative to total cases selected on specific date
## B_Cases = proportion of cases selected by Cohort B relative to total cases selected on specific date
## C_Cases = proportion of cases selected by Cohort C relative to total cases selected on specific date
"""# Merge
* lost a lot of data :/
* dont merge yet?
"""
Foods = hours_CohortHrsCases.merge(cases_EachPerDay, how = 'inner', on = ['Date', 'BRNCH_CD'])
#adding dates and merging with the sae dataset
Foods2 = Foods
#adding the response variable(s) to the dataset
Foods2["cases_hrs"] = Foods2['Total_Cases']/Foods2['Total_Hours'] #creates the dependent variable we are trying to measure
Foods2.replace([np.inf, -np.inf], np.nan, inplace=True) #converting all infinites to na's. will want to come up with a better solution later
Foods2 = Foods2.dropna() #drops all na's for now
weekdays = {0: "Monday",
1: "Tuesday",
2: "Wednesday",
3: "Thursday",
4: "Friday",
5: "Saturday",
6: "Sunday"}
months = {1: "January",
2: "February",
3: "March",
4: "April",
5: "May",
6: "June",
7: "July",
8: "August",
9: "September",
10: "October",
11: "November",
12: "December"}
Foods2['weekday'] = Foods2['Date'].apply(lambda x: weekdays[x.weekday()]) #adds the weekday of the particular entry
Foods2['month'] = Foods2['Date'].apply(lambda x: months[x.month]) #add the reporting_date month as a separate variable
#adds the sae data to the main dataset
sae = pd.read_csv('data/sae.csv')
Foods3 = Foods2.merge(sae, how='outer', on='BRNCH_CD')
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import numpy as np
import random
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
random_seed = 46
np.random.seed(random_seed)
random.seed(random_seed)
dataset = Foods3
dataset = dataset[dataset["cases_hrs"] <= 300]
dataset = dataset[dataset["Total_Hours"] >= 10]
#dropping all of the uneeded columns
to_drop = ["Date", "Total_Hours", "Total_Cases", "B_HrsPct", "B_Cases", "Total_Each_Day", "dry_ratio", "clr_ratio", "frz_ratio", "GO_LIVE_DATE", "LABEL_TYPE"]
dataset_build = dataset.drop(labels=to_drop, axis=1)
dataset_build.rename(columns={"BRNCH_CD":"brnch_cd", "A_HrsPct":"a_hrspct", "C_HrsPct":"c_hrspct", "A_Cases":"a_cases", "C_Cases":"c_cases"}, inplace=True)
#pipelines
numeric_features = ["a_hrspct", 'c_hrspct', 'a_cases', 'c_cases']
numeric_transformer = Pipeline(
steps=[("scaler", StandardScaler())]
)
categorical_features=['brnch_cd', 'weekday', 'month']
categorical_transformer = Pipeline(
steps=[
("encoder", OneHotEncoder())
]
)
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
X_train, X_test, y_train, y_test = train_test_split(dataset_build.drop(labels="cases_hrs", axis=1),dataset_build['cases_hrs'], random_state=random_seed, train_size = .70)
scores = []
#linear regression
reg_pipe = Pipeline(
steps=[("preprocessor", preprocessor), ("model", LinearRegression())]
)
reg_pipe.fit(X_train, y_train)
scores.append(["Reg", reg_pipe.score(X_test, y_test)])
#gradient boosting regressor model
gbr_pipe = Pipeline(
steps=[("preprocessor", preprocessor), ("model", GradientBoostingRegressor(random_state=random_seed))]
)
gbr_pipe.fit(X_train, y_train)
scores.append(["GBR Init", gbr_pipe.score(X_test, y_test)])
#improved gbr
gbr_improved_pipe = Pipeline(
steps=[("preprocessor", preprocessor), ("model", GradientBoostingRegressor(verbose=1, n_estimators=1000, learning_rate=0.3, max_depth=10, random_state=random_seed, loss='squared_error'))]
)
gbr_improved_pipe.fit(X_train, y_train)
scores.append(["GBR Improved", gbr_improved_pipe.score(X_test, y_test)])
#improved gbr 2.0 - lower depth
gbr_improved_pipe_2 = Pipeline(
steps=[("preprocessor", preprocessor), ("model", GradientBoostingRegressor(verbose=1, n_estimators=1000, learning_rate=0.3, max_depth=5, random_state=random_seed, loss='squared_error'))]
)
gbr_improved_pipe_2.fit(X_train, y_train)
scores.append(["GBR Improved", gbr_improved_pipe_2.score(X_test, y_test)])
#improved gbr
## to reduce variablity, depth for gbr should be between 4 - 8
## n_estimators = 2000
## smaller learning rate = 0.05 - 0.15 is good
## 4 minutes
gbr_improved_pipe_3 = Pipeline(
steps=[("preprocessor", preprocessor), ("model", GradientBoostingRegressor(verbose=1,
n_estimators=2000,
learning_rate=0.05,
max_depth=6,
loss='squared_error'))]
)
gbr_improved_pipe_3.fit(X_train, y_train)
scores.append(["GBR Improved 3",gbr_improved_pipe_3.score(X_test, y_test)])
#Extra trees base model, 5 minutes
xtratree_pipe = Pipeline(
steps=[("preprocessor", preprocessor), ("model", ExtraTreesRegressor(verbose=1))]
)
xtratree_pipe.fit(X_train, y_train)
scores.append(["xtratree",xtratree_pipe.score(X_test, y_test)])
#baseline RFR, outperformed by extratrees regressor
randfor_pipe = Pipeline(
steps=[("preprocessor", preprocessor), ("model", RandomForestRegressor(verbose=1))]
)
randfor_pipe.fit(X_train, y_train)
scores.append(["RFR", randfor_pipe.score(X_test, y_test)])
#baseline Decision tree regressor
dectree_pipe = Pipeline(
steps=[("preprocessor", preprocessor), ("model", DecisionTreeRegressor())]
)
dectree_pipe.fit(X_train, y_train)
scores.append(["Decision Tree",dectree_pipe.score(X_test, y_test)])
#random forest grid search
crossvalidation=KFold(n_splits=3,shuffle=True,random_state=1)
randfor_param = {
'max_depth': [5, 10, 15],
'n_estimators': [500, 1000, 1500]}
randfor_search = GridSearchCV(RandomForestRegressor(), randfor_param, refit = True, verbose = 3, cv = crossvalidation, scoring = 'neg_mean_squared_error')
randfor_search.fit(X_train, y_train)
from joblib import load, dump
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import set_config
import matplotlib.pyplot as plt
from sklearn.inspection import permutation_importance
#setting the random seed
random_seed = 46
np.random.seed(random_seed)
random.seed(random_seed)
#loading in the models and datasets
reg = load("models/lin_reg.joblib")
gbr_init = load("models/gbr_init.joblib")
gbr_imp = load("models/gbr_improved.joblib")
gbr_init_golive = load("models/gbr_init_golive.joblib")
gbr_imp_2 = load("models/gbr_improved_2.joblib")
gbr_imp_3 = load("models/gbr_improved_3.joblib")
xtratree = load("models/xtra_tree.joblib")
randfor = load("models/rfr.joblib")
dectree = load("models/dec_tree.joblib")
dataset=dataset_build
models = []
#testing the models
X_train, X_test, y_train, y_test = train_test_split(dataset.drop(labels="cases_hrs", axis=1),dataset['cases_hrs'], random_state=random_seed, train_size = .70)
X = X_test
y = y_test
#reg model
reg_pred = reg.predict(X)
models.append(["Base Linear Regression", mean_squared_error(reg_pred, y, squared=False), r2_score(reg_pred,y), reg.score(X, y)])
#gbr1 model
gbr_init_pred = gbr_init.predict(X)
models.append(["Base GBR", mean_squared_error(gbr_init_pred, y, squared=False), r2_score(gbr_init_pred,y), gbr_init.score(X, y)])
#improved gbr model
gbr_imp_pred = gbr_imp.predict(X)
models.append(["Improved GBR", mean_squared_error(gbr_imp_pred, y, squared=False), r2_score(gbr_imp_pred,y), gbr_imp.score(X, y)])
#improved gbr model 2.0
gbr_imp_2_pred = gbr_imp_2.predict(X)
models.append(["Improved GBR 2.0", mean_squared_error(gbr_imp_2_pred, y, squared=False), r2_score(gbr_imp_2_pred,y), gbr_imp_2.score(X, y)])
#improved gbr model 3.0
gbr_imp_3_pred = gbr_imp_3.predict(X)
models.append(["Improved GBR 3.0", mean_squared_error(gbr_imp_3_pred, y, squared=False), r2_score(gbr_imp_3_pred,y), gbr_imp_3.score(X, y)])
#xtra tree model
xtratree_pred = xtratree.predict(X)
models.append(["Xtra Tree", mean_squared_error(xtratree_pred, y, squared=False), r2_score(xtratree_pred,y), xtratree.score(X, y)])
#RFR model
randfor_pred = randfor.predict(X)
models.append(["RFR", mean_squared_error(randfor_pred, y, squared=False), r2_score(randfor_pred,y), randfor.score(X, y)])
#RFR model
dectree_pred = dectree.predict(X)
models.append(["Decision Tree", mean_squared_error(dectree_pred, y, squared=False), r2_score(dectree_pred,y), dectree.score(X, y)])
#python model prez
models_pres = pd.DataFrame(models, columns=["Model", "RMSE", "R2", "Model Score"])
gbr_result = permutation_importance(
gbr_imp_3, X_test, y_test, n_repeats=5, random_state=42, n_jobs=1
)
xtratree_result = permutation_importance(
xtratree, X_test, y_test, n_repeats=5, random_state=42, n_jobs=1
)
randfor_result = permutation_importance(
randfor, X_test, y_test, n_repeats=5, random_state=42, n_jobs=1
)
#!! pandas import/data wrangling for presentation stuff
master_data = pd.read_csv("data/master_dataset.csv")
gbr_model = load("models/gbr_improved.joblib")
hours = pd.read_csv("data/hours.csv")
gbr_model_old = load("models/gbr_improved.joblib")
gbr_model = load("models/gbr_improved_3.joblib")
xtratree_model = load("models/xtra_tree.joblib")
# randfor_model = load("../../notebooks/models/rfr.joblib")
#!! model stuff
model_data = master_data[master_data["cases_hrs"] <= 300]
model_data = model_data[model_data["Total_Hours"] >= 10]
#dropping all of the redundant columns
to_drop = ["Total_Hours", "Total_Cases", "B_HrsPct", "B_Cases", "Total_Each_Day", "dry_ratio", "clr_ratio", "frz_ratio", "LABEL_TYPE"]
model_data.drop(labels=to_drop, axis=1, inplace=True)
model_data.rename(columns={"BRNCH_CD":"brnch_cd", "A_HrsPct":"a_hrspct", "C_HrsPct":"c_hrspct", "A_Cases":"a_cases", "C_Cases":"c_cases"}, inplace=True)
dates = model_data.pop("Date")
go_live = model_data.pop("GO_LIVE_DATE")
#removing the real values
y = model_data.pop('cases_hrs')
X = model_data
# 1=gbr; 2=xtratree; 3=randfor; 4=old gbr
def model_pred(model):
#replacing the actual cpmh values with predicted ones from the model
df_pred = X
if(model == 1):
df_pred['cases_hrs_pred'] = gbr_model.predict(X)
if(model == 2):
df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
if(model == 3):
df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
if(model == 4):
df_pred['cases_hrs_pred'] = gbr_model_old.predict(X)
df_pred['cases_hrs_real'] = y
df_pred['Date'] = dates
df_pred["GO_LIVE_DATE"] = go_live
df_pred['Date'] = pd.to_datetime(df_pred['Date'])
return df_pred
#!! creating a dictionary that will allow the branch names to appear in the dropdown menu
div_nm = hours['FULL_MARKET_NAME'].unique()
brnch = model_data['brnch_cd'].unique()
branches = {}
for b in brnch:
for d in div_nm:
if b in d:
branches[b] = d
knitr::opts_chunk$set(include = T)
knitr::opts_chunk$set(echo = F)
knitr::opts_chunk$set(message = F)
knitr::opts_chunk$set(warning = F)
foods = py$Foods3
model_data = py$dataset
hours = py$hours
#code about the outliers in the response variable
foods$consump = foods$Total_Cases/foods$Total_Hours
hist(foods$consump, main="Distribution of Cases-per-Man-Hour Pre-Wrangling")
#code about the total hours being extremely low
ggplot(data=foods, aes(x=Total_Hours)) + geom_histogram() + ggtitle("Distribution of Total Hours Pre-Wrangling")
#code about the differences in the datasets after dropping the outliers
food.less = foods[foods$Total_Hours > 10,]
food.less = food.less[food.less$consump < 300, ]
ggplot(data=food.less, aes(x=Total_Hours)) + geom_histogram() + ggtitle("Distribution of Total Hours Post-Wrangling")
ggplot(data=food.less, aes(x=consump)) + geom_histogram() + ggtitle("Distribution of Cases-per-Man-Hours Post-Wrangling")
#code about the difference in productivity rates between the cohorts
hours.consump = hours %>% dplyr::mutate(consump = CASES_SELECTED / MEASURED_DIRECT_HRS) %>% filter(MEASURED_DIRECT_HRS > 10) %>% filter(consump < 300)
ggplot(data=hours.consump, aes(y=consump, x=COHORT)) + geom_boxplot() + ggtitle("Distribution of Cases-per-Man-Hours Post-Wrangling")
barplot(table(foods$weekday), main="Number of Observations by Weekday")
model.pres = py$models_pres
kable(model.pres)
#gbr 3 Feature importance / Permutation importance plot
feature_importance = gbr_imp_3.steps[1][1].feature_importances_
sorted_idx = np.argsort(feature_importance[:5])
pos = np.arange(sorted_idx.shape[0]) + 0.5
fig = plt.figure(figsize=(8, 6))
plt.subplot(1, 2, 1)
plt.barh(pos, feature_importance[sorted_idx], align="center")
plt.yticks(pos, np.array(X_train.columns)[sorted_idx])
plt.title("Feature Importance (MDI)")
sorted_idx = gbr_result.importances_mean[:5].argsort()
plt.subplot(1, 2, 2)
plt.boxplot(
gbr_result.importances[sorted_idx].T,
vert=False,
labels=np.array(X_train.columns)[sorted_idx],
)
plt.title("Permutation Importance (test set)")
fig.tight_layout()
plt.show()
#Xtratree Feature importance / Permutation importance plot
feature_importance = xtratree.steps[1][1].feature_importances_
sorted_idx = np.argsort(feature_importance[:5])
pos = np.arange(sorted_idx.shape[0]) + 0.5
fig = plt.figure(figsize=(8, 6))
plt.subplot(1, 2, 1)
plt.barh(pos, feature_importance[sorted_idx], align="center")
plt.yticks(pos, np.array(X_train.columns)[sorted_idx])
plt.title("Feature Importance (MDI)")
sorted_idx = xtratree_result.importances_mean[:5].argsort()
plt.subplot(1, 2, 2)
plt.boxplot(
xtratree_result.importances[sorted_idx].T,
vert=False,
labels=np.array(X_train.columns)[sorted_idx],
)
plt.title("Permutation Importance (test set)")
fig.tight_layout()
plt.show()
#Random Forest Feature importance plot
feature_importance = randfor.steps[1][1].feature_importances_
sorted_idx = np.argsort(feature_importance[:5])
pos = np.arange(sorted_idx.shape[0]) + 0.5
fig = plt.figure(figsize=(8, 6))
plt.subplot(1, 2, 1)
plt.barh(pos, feature_importance[sorted_idx], align="center")
plt.yticks(pos, np.array(X_train.columns)[sorted_idx])
plt.title("Feature Importance (MDI)")
sorted_idx = randfor_result.importances_mean[:5].argsort()
plt.subplot(1, 2, 2)
plt.boxplot(
randfor_result.importances[sorted_idx].T,
vert=False,
labels=np.array(X_train.columns)[sorted_idx],
)
plt.title("Permutation Importance (test set)")
fig.tight_layout()
plt.show()
#productivity drop after label change
vline <- function(x = 0, color = "red") {
list(
type = "line",
y0 = 0,
y1 = 1,
yref = "paper",
x0 = x,
x1 = x,
line = list(color = color)
)
}
data = py$model_pred(0)
data = data[data$brnch_cd == "2Z",]
data$GO_LIVE_DATE = as.Date(data$GO_LIVE_DATE)
data$Date = as.Date(data$Date)
branches = py$branches
p = plot_ly(x=data$Date,
y=data$cases_hrs_real, type="scatter", mode="lines", name="real") %>%
layout(
title=paste("Productivity for Branch", branches["2Z"]), shapes=list(vline(data$GO_LIVE_DATE[1]))) %>% add_text(showlegend = F, x = c(data$GO_LIVE_DATE[1] +100), y = c(250),
text = c("Go-Live Date"))
p
#code for comparing the new and old gbr models
data = py$model_pred(4)
newn = py$model_pred(1)
data$new_predict = newn$cases_hrs_pred
data = data[data$brnch_cd == "4C",]
data$GO_LIVE_DATE = as.Date(data$GO_LIVE_DATE)
data$Date = as.Date(data$Date)
p = plot_ly(x=data$Date,
y=data$cases_hrs_real, type="scatter", mode="lines", name="real") %>%
add_trace(y = ~data$new_predict, name = 'new model',mode = 'lines') %>%
add_trace(y = ~data$cases_hrs_pred, name = 'old model',mode = 'lines') %>%
layout(
title=paste("Productivity for Branch", branches["4C"]), shapes=list(vline(data$GO_LIVE_DATE[1]))) %>% add_text(showlegend = F, x = c(data$GO_LIVE_DATE[1] +100), y = c(250), text = c("Go-Live Date"))
p
#code for comparing best gbr model and actual values
xtra = py$model_pred(2)
data = py$model_pred(1)
data$xtra = xtra$cases_hrs_pred
data = data[data$brnch_cd == "3Y",]
data$GO_LIVE_DATE = as.Date(data$GO_LIVE_DATE)
data$Date = as.Date(data$Date)
p = plot_ly(x=data$Date,
y=data$cases_hrs_real, type="scatter", mode="lines", name="Real") %>%
add_trace(y = ~data$cases_hrs_pred, name = 'GBR',mode = 'lines') %>%
add_trace(y = ~data$xtra, name = 'Xtra Trees',mode = 'lines') %>%
layout(
title=paste("Productivity for Branch", branches["3Y"]), shapes=list(vline(data$GO_LIVE_DATE[1]))) %>% add_text(showlegend = F, x = c(data$GO_LIVE_DATE[1] +100), y = c(250), text = c("Go-Live Date"))
p
from dash import Dash, dcc, html, Input, Output
import pandas as pd
import plotly.graph_objects as go
from joblib import load
from datetime import datetime
#create a dashboard that shows these graphs by each warehouse
#!! pandas import/data wrangling for presentation stuff
if TESTING:
gbr_model_old = load("../../notebooks/models/gbr_improved.joblib")
master_data = pd.read_csv("../../notebooks/data/master_dataset.csv")
hours = pd.read_csv("../../notebooks/data/hours.csv")
else:
master_data = pd.read_csv("/app/notebooks/data/master_dataset.csv")
gbr_model = load("/app/notebooks/models/gbr_improved_dock.joblib")
hours = pd.read_csv("/app/notebooks/data/hours.csv")
gbr_model_old = load("../../notebooks/models/gbr_improved.joblib")
gbr_model = load("../../notebooks/models/gbr_improved_3.joblib")
xtratree_model = load("../../notebooks/models/xtra_tree.joblib")
# randfor_model = load("../../notebooks/models/rfr.joblib")
#!! model stuff
model_data = master_data[master_data["cases_hrs"] <= 300]
model_data = model_data[model_data["Total_Hours"] >= 10]
#dropping all of the redundant columns
to_drop = ["Total_Hours", "Total_Cases", "B_HrsPct", "B_Cases", "Total_Each_Day", "dry_ratio", "clr_ratio", "frz_ratio", "LABEL_TYPE"]
model_data.drop(labels=to_drop, axis=1, inplace=True)
model_data.rename(columns={"BRNCH_CD":"brnch_cd", "A_HrsPct":"a_hrspct", "C_HrsPct":"c_hrspct", "A_Cases":"a_cases", "C_Cases":"c_cases"}, inplace=True)
dates = model_data.pop("Date")
go_live = model_data.pop("GO_LIVE_DATE")
#removing the real values
y = model_data.pop('cases_hrs')
X = model_data
# 1=gbr; 2=xtratree; 3=randfor; 4=old gbr
def model_pred(model):
#replacing the actual cpmh values with predicted ones from the model
df_pred = X
if(model == 1):
df_pred['cases_hrs_pred'] = gbr_model.predict(X)
if(model == 2):
df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
if(model == 3):
df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
if(model == 4):
df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
df_pred['cases_hrs_real'] = y
df_pred['Date'] = dates
df_pred["GO_LIVE_DATE"] = go_live
df_pred['Date'] = pd.to_datetime(df_pred['Date'])
return df_pred
#!! creating a dictionary that will allow the branch names to appear in the dropdown menu
div_nm = hours['FULL_MARKET_NAME'].unique()
brnch = model_data['brnch_cd'].unique()
branches = {}
for b in brnch:
for d in div_nm:
if b in d:
branches[b] = d
#!! dash stuff
layout = html.Div([
html.H1("Warehouse Analytics"),
html.Div([
html.H3("Predicted Productivity vs Actual Productivity over time per Branch using GBR"),
dcc.Dropdown(branches, "2Z", id='dropdown-pred-real_gbr'),
dcc.Graph(id="pred_real_gbr"),
]),
html.Div([
html.H3("Predicted Productivity vs Actual Productivity over time per Branch using XtraTrees"),
dcc.Dropdown(branches, "2Z", id='dropdown-pred-real_xtra'),
dcc.Graph(id="pred_real_xtra"),
]),
html.Div([
html.H3("Predicted Productivity vs Actual Productivity over time per Branch using Old GBR"),
dcc.Dropdown(branches, "2Z", id='dropdown-pred-real_oldgbr'),
dcc.Graph(id="pred_real_oldgbr"),
])
])
if TESTING:
app = Dash(__name__)
app.layout = layout
"""Graphing real values over predicted values using GBR"""
@app.callback(
Output("pred_real_gbr", "figure"),
Input("dropdown-pred-real_gbr", "value"),
)
def display_gbr(value):
df = model_pred(1)
df = df[df['brnch_cd'] == value]
fig = go.Figure([
go.Scatter(x=df['Date'], y=df['cases_hrs_pred'], name="Predicted Values"),
go.Scatter(x=df['Date'], y=df['cases_hrs_real'], name="Real Values")
])
fig.add_vline(x=df.iloc[0][10], line_width=3)
fig.add_vline(x=datetime(2020, 3, 1), line_width=3, line_color="green")
return fig
"""Graphing real values over predicted values using XtraTrees"""
@app.callback(
Output("pred_real_xtra", "figure"),
Input("dropdown-pred-real_xtra", "value"),
)
def display_xtra(value):
df = model_pred(2)
df = df[df['brnch_cd'] == value]
fig = go.Figure([
go.Scatter(x=df['Date'], y=df['cases_hrs_pred'], name="Predicted Values"),
go.Scatter(x=df['Date'], y=df['cases_hrs_real'], name="Real Values")
])
fig.add_vline(x=df.iloc[0][10], line_width=3)
fig.add_vline(x=datetime(2020, 3, 1), line_width=3, line_color="green")
return fig
"""Graphing real values over predicted values using Old GBR"""
@app.callback(
Output("pred_real_oldgbr", "figure"),
Input("dropdown-pred-real_oldgbr", "value"),
)
def display_oldgbr(value):
df = model_pred(4)
df = df[df['brnch_cd'] == value]
fig = go.Figure([
go.Scatter(x=df['Date'], y=df['cases_hrs_pred'], name="Predicted Values"),
go.Scatter(x=df['Date'], y=df['cases_hrs_real'], name="Real Values")
])
fig.add_vline(x=df.iloc[0][10], line_width=3)
fig.add_vline(x=datetime(2020, 3, 1), line_width=3, line_color="green")
return fig
if TESTING == True:
app.run_server(debug=True)
from dash import Dash, dcc, html, Input, Output
import pandas as pd
import plotly.graph_objects as go
from joblib import load
from datetime import datetime
#create a dashboard that shows these graphs by each warehouse
#!! pandas import/data wrangling for presentation stuff
if TESTING:
gbr_model_old = load("../../notebooks/models/gbr_improved.joblib")
master_data = pd.read_csv("../../notebooks/data/master_dataset.csv")
hours = pd.read_csv("../../notebooks/data/hours.csv")
else:
master_data = pd.read_csv("/app/notebooks/data/master_dataset.csv")
gbr_model = load("/app/notebooks/models/gbr_improved_dock.joblib")
hours = pd.read_csv("/app/notebooks/data/hours.csv")
gbr_model_old = load("../../notebooks/models/gbr_improved.joblib")
gbr_model = load("../../notebooks/models/gbr_improved_3.joblib")
xtratree_model = load("../../notebooks/models/xtra_tree.joblib")
# randfor_model = load("../../notebooks/models/rfr.joblib")
#!! model stuff
model_data = master_data[master_data["cases_hrs"] <= 300]
model_data = model_data[model_data["Total_Hours"] >= 10]
#dropping all of the redundant columns
to_drop = ["Total_Hours", "Total_Cases", "B_HrsPct", "B_Cases", "Total_Each_Day", "dry_ratio", "clr_ratio", "frz_ratio", "LABEL_TYPE"]
model_data.drop(labels=to_drop, axis=1, inplace=True)
model_data.rename(columns={"BRNCH_CD":"brnch_cd", "A_HrsPct":"a_hrspct", "C_HrsPct":"c_hrspct", "A_Cases":"a_cases", "C_Cases":"c_cases"}, inplace=True)
dates = model_data.pop("Date")
go_live = model_data.pop("GO_LIVE_DATE")
#removing the real values
y = model_data.pop('cases_hrs')
X = model_data
# 1=gbr; 2=xtratree; 3=randfor; 4=old gbr
def model_pred(model):
#replacing the actual cpmh values with predicted ones from the model
df_pred = X
if(model == 1):
df_pred['cases_hrs_pred'] = gbr_model.predict(X)
if(model == 2):
df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
if(model == 3):
df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
if(model == 4):
df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
df_pred['cases_hrs_real'] = y
df_pred['Date'] = dates
df_pred["GO_LIVE_DATE"] = go_live
df_pred['Date'] = pd.to_datetime(df_pred['Date'])
return df_pred
#!! creating a dictionary that will allow the branch names to appear in the dropdown menu
div_nm = hours['FULL_MARKET_NAME'].unique()
brnch = model_data['brnch_cd'].unique()
branches = {}
for b in brnch:
for d in div_nm:
if b in d:
branches[b] = d
#!! dash stuff
layout = html.Div([
html.H1("Warehouse Analytics"),
html.Div([
html.H3("Predicted Productivity vs Actual Productivity over time per Branch using GBR"),
dcc.Dropdown(branches, "2Z", id='dropdown-pred-real_gbr'),
dcc.Graph(id="pred_real_gbr"),
]),
html.Div([
html.H3("Predicted Productivity vs Actual Productivity over time per Branch using XtraTrees"),
dcc.Dropdown(branches, "2Z", id='dropdown-pred-real_xtra'),
dcc.Graph(id="pred_real_xtra"),
]),
html.Div([
html.H3("Predicted Productivity vs Actual Productivity over time per Branch using Old GBR"),
dcc.Dropdown(branches, "2Z", id='dropdown-pred-real_oldgbr'),
dcc.Graph(id="pred_real_oldgbr"),
])
])
if TESTING:
app = Dash(__name__)
app.layout = layout
"""Graphing real values over predicted values using GBR"""
@app.callback(
Output("pred_real_gbr", "figure"),
Input("dropdown-pred-real_gbr", "value"),
)
def display_gbr(value):
df = model_pred(1)
df = df[df['brnch_cd'] == value]
fig = go.Figure([
go.Scatter(x=df['Date'], y=df['cases_hrs_pred'], name="Predicted Values"),
go.Scatter(x=df['Date'], y=df['cases_hrs_real'], name="Real Values")
])
fig.add_vline(x=df.iloc[0][10], line_width=3)
fig.add_vline(x=datetime(2020, 3, 1), line_width=3, line_color="green")
return fig
"""Graphing real values over predicted values using XtraTrees"""
@app.callback(
Output("pred_real_xtra", "figure"),
Input("dropdown-pred-real_xtra", "value"),
)
def display_xtra(value):
df = model_pred(2)
df = df[df['brnch_cd'] == value]
fig = go.Figure([
go.Scatter(x=df['Date'], y=df['cases_hrs_pred'], name="Predicted Values"),
go.Scatter(x=df['Date'], y=df['cases_hrs_real'], name="Real Values")
])
fig.add_vline(x=df.iloc[0][10], line_width=3)
fig.add_vline(x=datetime(2020, 3, 1), line_width=3, line_color="green")
return fig
"""Graphing real values over predicted values using Old GBR"""
@app.callback(
Output("pred_real_oldgbr", "figure"),
Input("dropdown-pred-real_oldgbr", "value"),
)
def display_oldgbr(value):
df = model_pred(4)
df = df[df['brnch_cd'] == value]
fig = go.Figure([
go.Scatter(x=df['Date'], y=df['cases_hrs_pred'], name="Predicted Values"),
go.Scatter(x=df['Date'], y=df['cases_hrs_real'], name="Real Values")
])
fig.add_vline(x=df.iloc[0][10], line_width=3)
fig.add_vline(x=datetime(2020, 3, 1), line_width=3, line_color="green")
return fig
if TESTING == True:
app.run_server(debug=True)